32 research outputs found
Towards Exascale Scientific Metadata Management
Advances in technology and computing hardware are enabling scientists from
all areas of science to produce massive amounts of data using large-scale
simulations or observational facilities. In this era of data deluge, effective
coordination between the data production and the analysis phases hinges on the
availability of metadata that describe the scientific datasets. Existing
workflow engines have been capturing a limited form of metadata to provide
provenance information about the identity and lineage of the data. However,
much of the data produced by simulations, experiments, and analyses still need
to be annotated manually in an ad hoc manner by domain scientists. Systematic
and transparent acquisition of rich metadata becomes a crucial prerequisite to
sustain and accelerate the pace of scientific innovation. Yet, ubiquitous and
domain-agnostic metadata management infrastructure that can meet the demands of
extreme-scale science is notable by its absence.
To address this gap in scientific data management research and practice, we
present our vision for an integrated approach that (1) automatically captures
and manipulates information-rich metadata while the data is being produced or
analyzed and (2) stores metadata within each dataset to permeate
metadata-oblivious processes and to query metadata through established and
standardized data access interfaces. We motivate the need for the proposed
integrated approach using applications from plasma physics, climate modeling
and neuroscience, and then discuss research challenges and possible solutions
Recommended from our members
ExaHDF5: Delivering Efficient Parallel I/O on Exascale Computing Systems
Scientific applications at exascale generate and analyze massive amounts of data. A critical requirement of these applications is the capability to access and manage this data efficiently on exascale systems. Parallel I/O, the key technology enables moving data between compute nodes and storage, faces monumental challenges from new applications, memory, and storage architectures considered in the designs of exascale systems. As the storage hierarchy is expanding to include node-local persistent memory, burst buffers, etc., as well as disk-based storage, data movement among these layers must be efficient. Parallel I/O libraries of the future should be capable of handling file sizes of many terabytes and beyond. In this paper, we describe new capabilities we have developed in Hierarchical Data Format version 5 (HDF5), the most popular parallel I/O library for scientific applications. HDF5 is one of the most used libraries at the leadership computing facilities for performing parallel I/O on existing HPC systems. The state-of-the-art features we describe include: Virtual Object Layer (VOL), Data Elevator, asynchronous I/O, full-featured single-writer and multiple-reader (Full SWMR), and parallel querying. In this paper, we introduce these features, their implementations, and the performance and feature benefits to applications and other libraries
Recommended from our members
Detecting Atmospheric Rivers in Large Climate Datasets
Extreme precipitation events on the western coast of North America are often traced to an unusual weather phenomenon known as atmospheric rivers. Although these storms may provide a significant fraction of the total water to the highly managed western US hydrological system, the resulting intense weather poses severe risks to the human and natural infrastructure through severe flooding and wind damage. To aid the understanding of this phenomenon, we have developed an efficient detection algorithm suitable for analyzing large amounts of data. In addition to detecting actual events in the recent observed historical record, this detection algorithm can be applied to global climate model output providing a new model validation methodology. Comparing the statistical behavior of simulated atmospheric river events in models to observations will enhance confidence in projections of future extreme storms.
Our detection algorithm is based on a thresholding condition on the total column integrated water vapor established by Ralph et al. (2004) followed by a connected component labeling procedure to group the mesh points into connected regions in space. We develop an efficient parallel implementation of the algorithm and demonstrate good weak and strong scaling. We process a 30-year simulation output on 10,000 cores in under 3 seconds
Expediting Scientific Data Analysis with Reorganization of Data
Data producers typically optimize the layout of data files to minimize the write time. In most cases, data analysis tasks read these files in access patterns different from the write patterns causing poor read performance. In this paper, we introduce Scientific Data Services (SDS), a framework for bridging the performance gap between writing and reading scientific data. SDS reorganizes data to match the read patterns of analysis tasks and enables transparent data reads from the reorganized data. We implemented a HDF5 Virtual Object Layer (VOL) plugin to redirect the HDF5 dataset read calls to the reorganized data. To demonstrate the effectiveness of SDS, we applied two parallel data organization techniques: a sort-based organization on a plasma physics data and a transpose-based organization on mass spectrometry imaging data. We also extended the HDF5 data access API to allow selection of data based on their values through a query interface, called SDS Query. We evaluated the execution time in accessing various subsets of data through existing HDF5 Read API and SDS Query. We showed that reading the reorganized data using SDS is up to 55X faster than reading the original data
Recommended from our members
Parallel I/O, Analysis, and Visualization of a Trillion Particle Simulation
Recommended from our members